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ABSTRACT 

In this paper we present a new mapping algorithm for speech 
recognition that relates the features of simultaneous recordings 
of clean and noisy speech. The model is a piece wise linear trans- 
formation applied to the noisy speech feature. The transforma- 
tion is a set of multidimensional linear least-squares filters whose 
outputs are combined using a conditional Gaussian model. The 
algorithm was tested using SRI's DECIPHER™ speech recogni- 
tion system [1-5]. Exrierimental results show how the mapping is 
used to reduce recognition errors when the training and testing 
acoustic environments do not match. 

1. INTRODUCTION 

In many practical situations an automatic speech recog- 
nizer has to operate in several different but well-defined acoustic 
environments. For example, the same recognition task may be 
implemented using different microphones or transmission chan- 
nels. In this situation it may not be practical to recollect a speech 
corpus to train the acoustic models of the recognizer. To alleviate 
this problem, we propose an algorithm that maps speech features 
between two acoustic spaces. The models of the mapping algo- 
rithm are trained using a small database recorded simultaneously 
in both environments. 

In the case of steady-state additive homogenous noise, we 
can derive a MMSE estimate of the clean speech filterbank-Iog 
energy features using a model for how the features change in the 
presence of this noise [6-7]. In these algorithms, the estimated 
speech spectrum is a function of the global spectral SNR, the 
instantaneous spectral SNR, and the overall spectral shape of the 
speech signal. However, after studying simultaneous recordings 
made with two microphones, we believe that the relationship 
between the two simultaneous features is nonlinear. We therefore 
propose to use a piece wise-linear model to relate the two feature 
spaces. 

There have been several algorithms in the literature which 
have focused on experimentally training a mapping between the 
noisy features and the clean features [8-13]. The proposed algo- 
rithm differs from previous algorithms in several ways: 

• The MMSE estimate of the clean speech features in noise is 
trained experimentally rather than with a model as in [6. 7], 

• Several frames are joined together similar to [ 1 3] . 



• The conditional PDF is based on a generic noisy feature not 
necessarily related to the feature that we are trying to esti- 
mate. For example, we could condition the estimate of the 
cepstral energy on the instantaneous spectral SNR vector. 

• Multidimensional least-squares filters are used for the map- 
ping transformation. This is used to exploit the correlation 
of the features over time and among components of the 
spectral features at the same time. 

• Linear transformations are combined together without hard 
decisions. 

• All delta parameters are computed after mapping the cep- 
strum and cepstral energy. 

• The mapping parameters are trained using stereo record- 
ings with two different microphones. Once trained, the 
mapping parameters are fixed. 

• The mapping can be used to map either noisy speech fea- 
tures to clean features during training, or clean features to 
noisy features during recognition. 

2. THE POF ALGORITHM 

The mapping algorithm is based on a probabilistic piece- 
wise-linear transformation of the acoustic space that we call 
Probabilistic Optimum Filtering (POF). Let us assume that the 
recognizer is trained with data recorded with a high-quality 
close-talking microphone (clean speech), and the test data is 
acquired in a different acoustic environment (noisy speech). Our 
goal is to estimate a clean feature vector & n given its correspond- 
ing noisy feature y n where n is the frame index. (A list of sym- 
bols is shown in Table 1 .) To estimate the clean vector we vector- 
quantize the clean feature space in / regions using the general- 
ized Lloyd algorithm [14]. Each VQ region is assigned a multidi- 
mensional transversal filter (see Figure 1). The error between the 
clean vector and the estimated vectors produced by the i-th filter 
is given by 



-wfy_ 



(i) 



where e - is the error associated with region i, W j is the filter 
coefficient matrix, and Y n is the tapped -delay line of the noisy 
vectors. Expanding these matrices we get 



(2) 
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Figure 1 : Multi-dimensional transversal filter for cluster /. 
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The conditional error in each region is defined as 

AT- J •■/> 

£ i = X KJ W.> (4 > 

/i = p 

where p{g t iz n ) is the probability that the clean vector x { belongs 
to region # ( given an arbitrary' conditional noisy feature vector 
Z M • Note that the conditioning noisy feature can be any acoustic 
vector generated from the noisy speech frame. For example, il 
may include an estimate of the signal-to-noise ratio (SNR). 
energy, cepstra) energy, cepstrum, etc. 

The conditional probability density function p(z n \$) is 
modeled a* a mature of /Gaussian distributions. Each Gaussian 
distribution models a VQ region. The parameters of the distribu- 
tions (mean vectors and co variance matrices) are estimated using 
the corresponding vectors associated with that region. The 
posterior probabilities p{g^Z n ) . are computed using Bayes' theo- 
rem and the mixture weights, p(g.) , are estimated using the rel- 
ative number of training clean vectors that are assigned to a 
given VQ region. 

To compute the optimum filters in the mean-squared error 
sense, we minimize the conditional error in each VQ region. The 
minimum mean-squared error vector is obtained by taking the 
gradient of £, defined in Eq. (4) with respect in the filter coeffi- 
cient matrix and equating all the elements of the gradient matrix 
to zero. As a result, the optimum filter coefficient matrix has the 
form, W i s= #T r f where 

*/= X (5) 

It — p 

is a probabilistic non -singular uuU) -correlation matrix, and 
S- i •-/> 

it -p 

is a probabilistic cross -correlation matrix. 

The algorithm can be completely trained without supervi- 
sion and requires no additional in formalin n other than the simul- 
taneous waveforms. 



The run -time estimate of the clean feature vector can be 
computed by integrating the outputs of all the filters as follows: 





Dimension 




Descrlp tion 


n 


1 


frame index 


i 


1 


region index 


L 


1 


feature vector size 


M 




conditioning feature vector size 


N 




number of training frames 


/ 




number of VQ regions 


P 




maximum filter delay 


£ ni 


Lxl 


estimation error vector 


*n 


Lxl 


clean feature vector 


*. 


Lxl 


estimate of clean feature vector 




Lxl 


noisy feature vector 




Mxl 


conditioning noisy feature vector 




MxJ 


mean vector of gaussian i 




MxM 


covariancu matrix of gaussian i 




(2p+l)L+lxL 


transversal filter coefficient matrix 


y . 


(2p+l)L+lxl 


tap input vector 




Lxl 


multiplicative tap matrix 


*i 


Lxl 


additive tap matrix 


R i 


(2p+l)L+lx 


auto-correlation matrix 




(2p+J)L+l 




r i 


(2p+l)L+lxL 


cross-correlation matrix 



—X 1. ■■ - - 

Table 1. List of Symbols'. 



3. EXPERIMENTS 
3.1. Introduction 

In this section we present a series of experiments that show 
how the mapping algorithm can be used in a continuous speech 
recognizer across acoustic environments. In all of the experi- 
ments the recognizer models are trained with data recorded with 
high-quality microphones and digitally sampled at 16,000 Hz. 
The analysis frame rate is 100 Hz. 

The tables below show three types of performance indica- 
tors: 

• Relative distortion measure. For a given component of a fea- 
ture vector we define the relative distortion between the 
clean and noisy data as follows: 

rf _ / eC(*->0 2 ] 

d - 4 va, W (8 » 

• Word recognition ermr. 

• Error ratio. The error ratio is given by E n /E r where E n is 
the word recognition error for the test-noisy/train -clean con- 
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dition, and E , is the word recognition error of the test-clean/ 
train-clean condition. 

3.2. Single Microphone 

To test the POF algorithm on a single target acoustic en\i- 
runment we used I he DARPA Wall Street Journal database [15] 
on SRI's DECIPHER™ phonetically ticd-mixture speech recog- 
nition system (2]. The signal processing consisted of a ti Iter ban it - 
based front-end that generated six feature streams: cepstrum (cl- 
i 12), cepstral energy (cO). and their first- and second-order deriv- 
atives. Cepstral-mean normalization (16) was used to equalize 
the channel. We used simultaneous recordings of high-quality 
speech (Sennheiser 414 head-mounted microphone with a noise- 
cancelling element) along; with speech recorded by a standard 
speaker phone (AT&T 720) and transmitted over local telephone 
lines. We will refer to this stereo data as clean and noisy speech 
respectively. The models of the recognizer were trained using 42 
male WSJO training talkers (3500 sentences) recorded with a 
Sennheiser microphone. The models of the mapping algorithm 
were trained using 240 development training sentences recorded 
by three speakers. The test set consisted of 100 sentences (mil 
included in the training set) recorded by the same three speakers. 

In this experiment we mapped two of the six features: the 
cepstrum (r.l-c.12) and the cepstral energy (cO) separately. The 
derivatives were computed from the mapped vectors of the ceps- 
tral features. For the conditioning feature we used a 13-dimen- 
sional cepstral vector (c0cl2) modeled with 512 Gaussians with 
diagonal co variance matrices. The results are shown in Table 2. 



Filter 

Coefficients 


Average 
Distortion 


Recognition 
Error (%) 


Error Ratio 


No mapping 




0.72 


27.6 


2.46 






0.62 


18.1 


1.62 






0.57 


17.0 


1.52 


A i.-l — A i.-i 


b i 


0.51 


173 


1.54 


A^ 2 - A i, -2 


*; 


0.50 


16.4 


1.46 


A i,-J ••- A k-s 


bi 


0.49 


15.9 


1.42 


A i-4 ••- A i,-4 


b i 


0.49 


16.1 


1.44 



Table 2. Performance of me POF algorithm for different 
number of filter coefficients. The number of Gaussian 
distributions is 5 1 2 per feature and the conditioning feature is a 
1 3-dimensional cepstral vector. 



The baseline experiment produced a word error rate of 
27.6% on the nobcy test sei. that is. 2.46 times the erroT obtained 
when using the clean data channel. A 34% improvement in rec- 
ognition performance was obtained when using only the additive 
filter coefficient b v (Recognition emir goes down to 18 1%.) The 
best result (15.93 recognition error) was obtained for the condi- 
tion /?=.?, in which six neighboring noisy frames are being used 
to estimate the feature vector for the current frame. The correla- 



tion between the average relative distortion between the six clean 
and noisy features and the recognition error is 0.9. 

3.3. Multiple Microphones 

To test the performance of the POF algorithm on multiple 
microphones we used SRI's stereo -ATIS database. (See the com- 
panion paper [1] for details.) In this database, we recorded the 
clean channel with a Sennheiser microphone and the noisy chan- 
nel with 10 different telephone handsets. For this set of experi- 
ments we also mapped the cepstrum vector (cl-cl2) and the 
cepstral energy (cO). The maximum delay of the filters was kept 
fixed at p=2. and the number of Gaussians was 5 12. We tried the 
following conditioning features: 

• Cepstrum. Same conditioning feature used in the single 
microphone experiment (cO-cl2), 

• Spectral SNR. This is an estimate of the instantaneous sig- 
nal-to-noise ratio computed on the log-filterbank energy 
domain. The vector size is 25. 

• Cepstral SNR. This feature is generated by applying the dis- 
crete cosine transform (DCT) to the spectral SNR. The trans- 
formation reduces the dimensionality of the vector from 25 
to 12 elements. 

The results are shown in Table 3. The baseline result is a 
19.4% word error rate. This result is achieved when the same 
wide-band front -end is used for training the models with clean 
data and for recognition using telephone data. When a telephone 
front-end [ I ] is used for training and testing, the error decreases 
to 9.7%. The disadvantage of using this approach is that the 
acoustic models of the recognizer have to be re-estimated. How- 
ever, the POP-based front -end operates on the clean models and 
results in better performance. The cepstral SNR produces the 
best result (8.7%). With this conditioning feature we combine the 
effects of noise and spectral shape in a compact representation. 





Word 


Error 


Experiment 


Error (%) 


Ratio 


Wide-band front-end 


19.4 


2.49 


Telephone- bandwidth front-end 


9.7 


1.24 


Mapping with cepstrum 


9.4 


1.20 


Mapping with spectral SNR 


8.9 


1.14 


Mapping with cepstral SNR 


8.7 


1.11 



Table 3. Performance for the multiple- telephone handset lest 



set. 

3.4. Using POF in Either Training or lasting 

The POF mapping can be applied to cither the training 
data or the testing data. When applied to the training data, it 
makes the clean speech features look like the noisy speech fea- 
tures. During recognition, the standard signal processing of the 
noisy speech features may be used. 
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When applied U< the testing data, it makes the noisy 
speech features look like the clean speech features. Training of 
the HMM acoustic models uses (he standard signal processing. 



Signal Processing 


Word 


Error 


Training 


Testing 


Error 


Ratio 


Standard 


Standard 


31.4 


2.7 


Map Clean u> 
Noisy 


Standard 


21.3 


1.8 


Standard 


Map Noisy U> 
Clean 


20.0 


1.7 



Table 4: Training and Testing Paradigms using the Probabilistic 
Optimum Filter. Word Error is on AT&T Speaker Phone 



The results in Table 4 show that equivalent performance 
is obtained when using the mapping either in training (21 .3%) or 
in testing (20.0%). In both cases, this is a significant decrease 
from the performance without compensation (31.4%). The rec- 
ognition numbers are slightly different from those in Table 2 
since this experiment uses an earlier version of the mapping and 
recognizer. 

4. CONCLUSIONS 

We have presented a feature mapping algorithm capable of 
exploiting nonlinear relations between two acoustic spaces. We 
have shown how to improve the performance of the recognizer in 
the presence of a noisy signal by using a small database with 
simultaneous recordings in the clean and noisy acoustic environ- 
ments. 

The mapping algorithm has performed well on a speaker- 
dependent/single -microphone task and on a speaker- indepen- 
dent/multiple-microphone task. In both cases the target acoustic 
environment was known a priori. The POF algorithm efficiently 
exploited the correlations within and between frames, .resulting 
in significant improvements over the unmapped systems. 

The POF algorithm can be used only when a stereo data- 
base containing the clean and noisy speech is available. This 
requirement limits the use of the POF algorithm to applications 
in which the target acoustic environment is well defined and sta- 
ble. These applications may include those for which the micro- 
phone, the channel or the background noise encountered in the 
field do not match the training conditions. 
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